Nighttime lights are a uniquely human phenomenon. Every night we capture data on this form of human emissions with the Earth Observation System VIIRS. While we expect that this data will tell us a great deal about the people living and working beneath the lights, there are still many factors we need to evaluate to back up those assumptions.
It’s unlikely that a headlamp would be seen from space, but we don’t know that for sure. We know that the random variability that events like this represent will be more impactful when fewer observations are present. Rather than try to answer these questions, it might be better to respect that the world is complex. Any automated task (like the one we are working on) will contain conditions that you are probably unaware of and couldn’t do much about, even if you did know they existed. Those assumptions, known and unknown, carry through the analysis and have the potential to skew your results in unexpected ways. Knowing that there is a lot you don’t know helps you keep your mind keen for more questions rather than set on a specific answer.
At the end of the last lesson, we created a correlation between yearly average radiance and poverty/age at the census tract level for three different counties in Texas. While the trends were not convincing, we assumed that the positive confirmation to our hypothesis is still out there, and by conducting a quality control check on the input data, we can pull out the truth.
Hopefully, the answer is yes because this is an educational tutorial, after all. Yet, your answer may feel differently if you were exploring this in a professional context. Looking into one question means you can’t investigate another. It is important to be aware of the Sunk Cost Fallacy and loss aversion as it affects your research. The more we invest in something, the less likely we are to back out.
The average monthly radiance values are delivered with an associated image layer that reports the number of daily observations that were used to generate the mean monthly value for each observation location. These values can range from 0-31, depending on the month. We want to ensure that we are only looking at observation locations where we have a high degree of confidence that the values represent a true mean.
There are three reasons why this is worth evaluating.
View Angle
The VIIRS sensor captures data in a 3,000 km swath. This means there is a lot of areas that are off-nadir captures. As night lights are generally directional features (think lights on the side of the building), the angle at which the image is captured will affect the radiance observed. This means that some nightly images will have lower or higher values than the actual observed value at nadir. We can get around this problem by working only with on nadir passes, but those images occur only every 14 days or so. That timeframe severely limits our total number of observations and would require a whole new workflow based on the daily images. The second option would be to ensure you have enough observations to average out that variability. How much is enough? We’ll try to find out.
Lunar Radiance
The moon is a large roundish rock that has influenced culture and inspired people’s curiosity for as long as we’ve been around. We’ve always cared about the moon because it reflects the radiance of the sun on the Earth. This reflectance means that we can see it on an otherwise dark night. Due to the orbit patterns of these three celestial bodies, we end up with about half of each month being darker than the other half. The effects of lunar radiance are substantial in darker areas, especially when combined with freshly fallen snow. Our counts data does not tell us anything about the date of the image captured, but we’ll cross our fingers and assume it’s normally distributed. Therefore, the more observations we get, the less we need to worry about the lunar radiance making places look brighter than they are.
Cloud Cover
Clouds are like the tall, wide person who wears a fedora everywhere and decides five minutes into a concert to stand right in front of you. Simply put, they are in the way. Clouds are worse than the unsavory concert goer because they affect the night lights data in two ways; they block and diffuse the radiance.
If the clouds are dense, they will block the light entirely. The VIIRS data utilizes an inferred band cloud mask, which captures these dense features. We don’t get any information about night lights for those evenings in our monthly averages. In a humid place like southeast Texas, we can expect to have a large portion of our daily data obscured by clouds.
Low diffuse clouds and fog can diffuse the radiance and spread it out over a larger area. This means that bright areas are dimmed, and darker areas can appear brighter. These clouds are close to the same temperature as the Earth’s surface, so they are not effectively captured by the inferred-based cloud mask. It’s best to assume some of this fuzziness from the clouds will be part of the monthly averages. Same as before, the more observations, the less that fuzziness will affect the average.
I don’t know. The good options for testing this assumption all require working with daily observations at each location. That is a lot more data and a lot more work. Today we’re going to use a more generalized approach to visualize multiple options. We’re not got to have a robust quantitative justification, but we’ll have some observations that support our assumption. This limitation is okay because we know it is an assumption, and we can state that to others when they assess the work.
To tackle this question, we’re probably going to need to understand exactly what the counts data is all about. So let’s get started by setting up our environment.
# load required libraries
library(sf)
library(raster)
library(dplyr)
library(tmap)
Now we will pull the counts files.
# grab all counts images
images <- list.files(path = "data/nightLights",
pattern = "_counts.tif",
full.names = TRUE)
images
## [1] "data/nightLights/april_10arc_counts.tif"
## [2] "data/nightLights/august_10arc_counts.tif"
## [3] "data/nightLights/december_10arc_counts.tif"
## [4] "data/nightLights/feburary_10arc_counts.tif"
## [5] "data/nightLights/janurary_10arc_counts.tif"
## [6] "data/nightLights/july_10arc_counts.tif"
## [7] "data/nightLights/june_10arc_counts.tif"
## [8] "data/nightLights/march_10arc_counts.tif"
## [9] "data/nightLights/may_10arc_counts.tif"
## [10] "data/nightLights/november_10arc_counts.tif"
## [11] "data/nightLights/october_10arc_counts.tif"
## [12] "data/nightLights/september_10arc_counts.tif"
We have not processed these images at all, so they are still as big as Texas (or only as small as 40% of Alaska, however you want to look at it).
# read in image
temp1 <- raster::raster(images[1])
temp1
## class : RasterLayer
## dimensions : 650, 798, 518700 (nrow, ncol, ncell)
## resolution : 0.01666667, 0.01666667 (x, y)
## extent : -106.7271, -93.42708, 25.75208, 36.58542 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs
## source : /Users/hayleypippin/Desktop/R_SC_Spatial/intermediateGeospatialR/data/nightLights/april_10arc_counts.tif
## names : april_10arc_counts
## values : 2, 20 (min, max)
There are over 500,000 observations in the image, and the values range from 2 to 20. Five hundred thousand elements if not that large for raster imagery, but it’s more than we need to carry around for this exploratory work. We will cut down this image to the extent of a county before going further.
# grab a radiance image
allImages <- list.files(path = "data/nightLights",
pattern = ".tif",
full.names = TRUE,
recursive = TRUE)
# print to find an image from a county
allImages[1:10]
## [1] "data/nightLights/april_10arc_counts.tif"
## [2] "data/nightLights/april_10arc.tif"
## [3] "data/nightLights/august_10arc_counts.tif"
## [4] "data/nightLights/august_10arc.tif"
## [5] "data/nightLights/Bexar/april_10arc.tif"
## [6] "data/nightLights/Bexar/august_10arc.tif"
## [7] "data/nightLights/Bexar/december_10arc.tif"
## [8] "data/nightLights/Bexar/feburary_10arc.tif"
## [9] "data/nightLights/Bexar/janurary_10arc.tif"
## [10] "data/nightLights/Bexar/july_10arc.tif"
It looks like the images from Bexar County start at position 5. Read in that image and use it to crop the state wide counts data. Write out the code yourself before checking the answer
# read in county processed image
r1 <- raster::raster(allImages[5])
# crop the raster
temp2 <- temp1 %>%
raster::crop(r1)
# pull attributes and view
qtm(temp2)
This cut the area of interest down significantly. If we print the object we can see the total number of observations
temp2
## class : RasterLayer
## dimensions : 39, 42, 1638 (nrow, ncol, ncell)
## resolution : 0.01666667, 0.01666667 (x, y)
## extent : -98.81041, -98.11041, 29.11875, 29.76875 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs
## source : memory
## names : april_10arc_counts
## values : 2, 11 (min, max)
We’re down to ~1,600 observations, and the high end of the number of observations drops from 20 to 11.
The map shows us that majority of the area seems to be below eight observations.
Also, note that we cropped the counts image based on the extent of the raster data from the given county. This command returns a rectangle based on the bounding box of the county raster. All this means is that some of this data will be outside of our area of interest, but since we’re just trying to learn about the counts data, that is okay.
Visualize the Values
There are many different ways to summarize data in a non-spatial format. Here are a few examples.
# grab the values of the raster object
vals <- raster::values(temp2)
# summary() base R
summary(vals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 5.000 6.000 6.192 7.000 11.000
# plot a histogram
hist(vals)
The data is pretty close to normally distributed. The mean is slightly higher than the median, so there is a bit of right skew. It’s good to see that just a limited number of locations had only two cloud-free observations in April. That’s the bare minimum needed to calculate a mean and is unlikely to account for all those potential error sources we spoke about earlier.
Let’s narrow this down a little more and look specifically in the county’s area by creating a mask object from our radiance raster.
# create a mask object
mask <- r1
# reassing all positive values to 1
mask[mask >= 0, ] <- 1
# set any value not equal to 1 as NA
mask[mask != 1, ] <- NA
There are probably other ways to make a mask object, but I’ve always liked this one. As the data within the raster object is effectively a matrix, you can perform indexing to reassign values as you would in a matrix. The significance of this will become more apparent as we progress, but it is super flexible and fast.
# multiple raster to apply the mask
counts <- temp2 * mask
qtm(counts)
By multiplying the images together, we keep all values within the mask area and reassign all values outside of it as NA. This is an algebraic operation between matrices, so it’s fast too.
vals2 <- raster::values(counts)
summary(vals2)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.000 5.000 6.000 6.181 7.000 11.000 535
hist(vals2)
Changing our area of interest did not significantly affect our summary statistics, so we are working at a scale that still captures the spatial variability of the observations.
Our end goal is to use the counts data to limit what locations are used to compute the yearly averages. By doing so, we’re also limiting how much of the overall area is actually represented for a particular month. We need to balance what we keep and what we drop. We can start by seeing what proportion of the county we are reporting on if we were to filter at various levels observed in the counts data.
# pull total number of observations
vals_noNA <- vals2[!is.na(vals2)]
total <- length(vals_noNA)
# determine sequence of interest
seq1 <- seq(min(vals_noNA),max(vals_noNA), by =1 )
For each element in the sequence, we want to remove that feature from the current list and determine the change in the total elements so we can calculate a change in the area. This is a great place for a function, as we’re applying the same process for each element in the sequence.
getArea <- function(values, index){
### values: vector of numerical features
### index: numerical value to filter on
# add na clause just to be safe
values <- values[!is.na(values)]
# get total
total <- length(values)
# get new values based on index
vals_new <- values[values >= index]
# calc average
ave <- 100*(length(vals_new)/ total)
return(ave)
}
This function performs a numerical filter on the vector of values and returns a percentage, which in this case represents the total number of observations. We can translate this to an area measurement using arithmetic. Conceptually, however, keeping it as a percentage seems easier to work with.
We need to apply the function and create a mechanism to store the output. We’ll use a data frame for this example.
# create a dataframe to store content
df <- data.frame(matrix(nrow = length(seq1), ncol = 2))
names(df) <- c("filter", "percent area")
# assign the filter element because we have it already
df$filter <- seq1
for(i in seq_along(seq1)){
# index column position using i, but define the filter value by seq1 feature
df$`percent area`[i] <- getArea(values = vals_noNA, index = seq1[i])
}
df
## filter percent area
## 1 2 100.00000000
## 2 3 99.90933817
## 3 4 98.18676337
## 4 5 90.66183137
## 5 6 68.81233001
## 6 7 41.43245694
## 7 8 15.86582049
## 8 9 2.71985494
## 9 10 0.45330916
## 10 11 0.09066183
By building out the full data frame beforehand, we are keeping the operation efficient. We needed to use the seq_alongfunction to index the position to store the information correctly. Below is another means of accomplishing this task using a counter, which can be handy at times, specifically when you have to deal with conditional statements within your loop.
n = 1
for(i in seq1){
# index column position using i, but define the filter value by seq1 feature
df$`percent area`[n] <- getArea(values = vals_noNA, index = i)
n = n + 1
}
df
We’ve Got a Number
Based on this test case, we can sense that dropping all locations with six observations or less still gives us 90% of the county to work with. The next question is: how would applying such a filter affect the amount of nighttime lights observed at the county level?
This is a tricky question because we don’t know anything about where the locations with less than 6 observations are at this point. We’ve been conducting these tests on non-spatial data. We also know that there are locations in the county that are very bright and many more, that are relatively dark (urban vs. rural areas). So, at a 10% reduction, we’re probably not going to see a big change in the mean and median of all the values in the county. But there is no way of knowing without trying it out.
We can start testing this question by bringing the spatial data back in by adapting our for-loop from above.
# create a dataframe to store content
df <- data.frame(matrix(nrow = length(seq1), ncol = 4))
### adding new columns for mean and median
names(df) <- c("filter", "percent area", "mean", "median")
# assign the filter element because we have it already
df$filter <- seq1
We’ve added new columns to our data frame to store the mean and median radiance values for the county. So far, we’ve been working with counts only, so we will need to bring the radiance image into the loop as well.
# Check to make sure the original feature we read in matches our month of interest
r1
## class : RasterLayer
## dimensions : 39, 42, 1638 (nrow, ncol, ncell)
## resolution : 0.01666667, 0.01666667 (x, y)
## extent : -98.81041, -98.11041, 29.11875, 29.76875 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs
## source : /Users/hayleypippin/Desktop/R_SC_Spatial/intermediateGeospatialR/data/nightLights/Bexar/april_10arc.tif
## names : april_10arc
## values : 0.425001, 116.5439 (min, max)
temp1
## class : RasterLayer
## dimensions : 650, 798, 518700 (nrow, ncol, ncell)
## resolution : 0.01666667, 0.01666667 (x, y)
## extent : -106.7271, -93.42708, 25.75208, 36.58542 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs
## source : /Users/hayleypippin/Desktop/R_SC_Spatial/intermediateGeospatialR/data/nightLights/april_10arc_counts.tif
## names : april_10arc_counts
## values : 2, 20 (min, max)
We got lucky here, but we’ll need to find a way to ensure the count and radiance image match temporally at some point. For now, let’s figure out how to do this once. We start by drafting out the work flow.
## Note: speculating on workflow, DO NOT RUN
i <- "filter level"
## create a mask of the counts layer
counts[counts >= i, ] <- 1
counts[counts < i, ] <- NA
## apply the mask to the radiance layer
rad1 <- rad * counts
## remove all NA values
rad_vals <- raster::values(rad1)
rad_vals <- rad_vals[!is.na(rad_vals), ]
## calculate mean and median
df$mean <- mean(rad_vals)
df$median <- median(rad_vals)
Alright, so we will need the clip and mask the monthly radiance images by the counts raster for each month. Then we can apply the same methods we’ve used before to derive the mean and median radiance at the county level. As we can see, this workflow inputs and outputs, so we can build it into a function.
radMeanAndMedian <- function(countRaster, radianceRaster, index){
## create a mask of the counts layer
countRaster[countRaster < index] <- NA
countRaster[countRaster >= index] <- 1
## apply the mask to the radiance layer
rad1 <- radianceRaster * countRaster
## remove all NA values
rad_vals <- raster::values(rad1)
rad_vals <- rad_vals[!is.na(rad_vals)]
## create a vector to store outputs
values <- c()
## calculate mean and median
values[1] <- mean(rad_vals)
values[2] <- median(rad_vals)
return(values)
}
We had to change where we are storing the mean and median values. We could return two objects, but I think it’s better to output a single feature and use indexing to access the data. With this new function in hand, let’s adjust our existing workflow to fit around it.
# define input parameters
count_rastula <- counts
rad_rast <- raster::raster(allImages[5])
# determine sequence of filters
count_vals <- raster::values(count_rastula)
vals_noNA <- count_vals[!is.na(count_vals)]
seq1 <-seq(min(vals_noNA), max(vals_noNA), by = 1)
# loop over filter values
for(i in seq_along(seq1)){
# run the area function
df$`percent area`[i] <- getArea(values = vals_noNA, index = seq1[i])
# run the mean median function
meanMedian <- radMeanAndMedian(countRaster = count_rastula,
radianceRaster = rad_rast,
index = seq1[i])
# a vector is returned with mean and median values, index to assign it to the correct positions
df[i,3:4] <- meanMedian
}
df
## filter percent area mean median
## 1 2 100.00000000 11.679998 4.518006
## 2 3 99.90933817 11.669617 4.515014
## 3 4 98.18676337 11.789309 4.588748
## 4 5 90.66183137 12.113821 4.980951
## 5 6 68.81233001 11.341239 4.686239
## 6 7 41.43245694 11.882842 5.769488
## 7 8 15.86582049 13.084229 6.895073
## 8 9 2.71985494 18.566201 13.547457
## 9 10 0.45330916 19.168742 3.585171
## 10 11 0.09066183 3.585171 3.585171
This looks great and we can see some more significant changes occurring between filter levels 6 and 7. Let’s take a second to visualize these results in a graphic.
We’re going to be using a new library here called plotly. What makes plotly stand out relative to ggplot2 is its ability to create interactive figures. This becomes particularly valuable when you’re utilizing Rmd documents to generate reports of your results or when you have a lot of information to show on a single graphic. Interactively allows you to be a bit less particular with your visualization parameters.
# install and load package
# install.packages("plotly")
library(plotly)
### Plot a figure
p1 <- plot_ly()
p1
plotly functions utilize the dplyr piping structure rather then the + operator like ggplot2. Both allow you to add to existing objects. This means we can start with a blank figure. p2 <- p1 %>%
add_trace(x = df$filter, y = df$`percent area`,type = 'scatter')
p2
add_trace function allows us to add elements to the figure. In this case, we plot filter levels on the x-axis and percent area on the y-axis with the type defined as scatter.
While this is discrete data and points are the correct means of displaying it, we can add a line to this feature to visualize the trend more clearly.
p3 <- p2%>%
add_trace(x = df$filter, y = df$`percent area`,type = 'scatter', line = list(dash = 'dash', shape= "spline"))
p3
Notice that a legend was added now that there are two features. Since the two features are the same, we simply don’t see one of them. The order in which we add features to the plotly object determines the visual hierarchy.
We don’t need two sets of the same data on the plot, so let’s recreate this from the start and add some more text to describe the legends.
p1 <- plot_ly() %>%
add_trace(x = df$filter, y = df$`percent area`,type = 'scatter', line = list(dash = 'dash', shape= "spline"))%>%
layout(xaxis = list(title = "Filter Level"),
yaxis = list(title = "Percentage of Coverage"))
p1
Note we can add the parameters we used in the add_trace call as parameters in the original plot_ly function. This method is more standard. add_trace is generally used to add multiple elements to a plot. The end result is the same.
# mean plot
p2 <- plot_ly(x = df$filter, y = df$mean,type = 'scatter', line = list(dash = 'dash', shape= "spline")) %>%
layout(xaxis = list(title = "Filter Level"),
yaxis = list(title = "Mean"))
# median plot
p3 <- plot_ly() %>%
add_trace(x=df$filter, y=df$median,type = 'scatter', line = list(dash = 'dash', shape= "spline"))%>%
layout(xaxis = list(title = "Filter Level"),
yaxis = list(title = "Median"))
p2
p3
plotly function called subplot. p <- plotly::subplot(p1,p2,p3, nrows = 3, shareX = TRUE, titleY = TRUE)
p
nrows = 3. This method works particularly well because all plots share the same x-axis, making for a nice compact plot. The titleY parameter carries the titles from the original plots through to the final figure.
With this visualization we start to see that even though we lose a lot of area at filter level 6, the mean and median for the county remain consistent. This means that we are still capturing the general quality of the night light at the county level spatial scale. 68% of the county could be a fair sample size for the county as a whole.
In the ongoing project on which this lesson is based, the group determine that 10 observations per month would be required for assuming a quality monthly signal. This meant that we had to transfer our area of interest from Houston to the more arid cities of Las Vegas, Phoenix, and Tuscon. There is always a balance in these choices and as long as you can justify why you made the decision feel confident in rolling with it.
At this point, we have a process developed for generating a data-rich visualization that shows how filtering the radiance data based on the number of observations changes the average radiance observed at the county level. This product is best suited for aiding a discussion around the quantity and quality of observations. So far, we’ve made it work once, on one month, for one county. If we wanted to be comprehensive in our assessment, we would need to apply this process across;
We could structure this out within a series of loops, but evaluating the result at the county level is probably more appropriate. It’s a concrete spatial scale that much of our current analysis is built on. As the results we want to show gain utility from interactivity, we can produce them via an .rmd to HTML rather than an R script. A bonus of the rmd process is that we can call the document directly from an R script in a similar way we call in a function. This is a lot to try to visualize, so let’s jump into it and discuss the details as we move through section 3 material.
Open a new RMD script and the countySummaries.html from the src/section3 folder.